Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval

نویسندگان

  • Schubert Foo
  • Hui Li
چکیده

In Chinese information retrieval (IR), word segmentation is an essential prerequisite process to break down the documents into smaller linguistic units or word segments so that they can be indexed for subsequent retrieval. Despite a host of Chinese information systems that are in existence today, little work has been done to study word segmentation accuracy and its effect on IR. This article describes a set of experiments that was conducted in the Division of Information Studies, Nanyang Technological University, Singapore, to explore this issue. Four types of automatic character-based segmentation approaches as well as a manual segmentation were used to index a set of test corpus, thereby resulting in five different indices for use in the IR experiments. The segmentation accuracy of each approach was obtained by comparing the automatic segmentation results with the manual segmentation results. A set of IR experiments provided a measure of IR effectiveness using the traditional measures of data recall and data precision. Statistical analysis was applied to explore the correlation between the segmentation accuracy and IR effectiveness. The analysis suggests that the word segmentation approach do have affect on IR results. In particular, the approach that recognizes the higher number of correct words that include 2 or more characters can produce better precision and recall. On the other hand, the existence of ambiguous words resulting from the word segmentation process adversely affect precision.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR

It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. In this paper we show that, for Chinese, the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in...

متن کامل

Evaluation of Stop Word Lists in Chinese Language

In modern information retrieval systems, effective indexing can be achieved by removal of stop words. Till now many stop word lists have been developed for English language. However, no standard stop word list has been constructed for Chinese language yet. With the fast development of information retrieval in Chinese language, exploring the evaluation of Chinese stop word lists becomes critical...

متن کامل

Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information

Covering ambiguity is one of the two basic types of ambiguities in Chinese word segmentation. We regard its resolution as equivalent to word sense disambiguation, and make use of the classical vector space model in information retrieval to formulate the contexts of ambiguous words. A variation form of TFIDF weighting is proposed and a Chinese thesaurus is additionally utilized to cope with data...

متن کامل

Chinese Word Segmentation and Information Retrieval

In this paper we present results of experiments with Chinese word segmentation and information retrieval. Our experiments with three different word segmentation algorithms indicate that accurate segmentation measurably improves retrieval performance. We discuss the evaluation of word segmentation algorithms for the purpose of better indexing segmented texts for retrieval.

متن کامل

Chinese word segmentation and its effect on information retrieval

A set of IR experiments was carried out to study the impact of Chinese word segmentation and its effect on information retrieval (IR) at the Division of Information Studies, Nanyang Technological University, Singapore. A total of four automatic character-based segmentation approaches and a manual word segmentation approach was first carried out to obtain the word segments for indexing and to ev...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004